Goto

Collaborating Authors

 experiment detail









A Experiment Details

Neural Information Processing Systems

Source code for the training pipeline, tasks, and models used in this work, is available as part of the supplementary material. We used the same Adam [48] optimizer for all our experiments and a learning rate of 0.001, and a batch size of 128. For solving the differential equations both during ground truth data generation as well as with the neural ODEs, we use the Tsitouras 5/4 Runge-Kutta (Tsit5) method from DifferentialEquations.jl [36]. A.1 Coupled Pendulum The coupled pendulum dynamics are defined as We train the MP-NODE on a dataset of 500 trajectories, each randomly initialized with state values between [ π/2, π/2] for the θ and [ 1, 1] for θ, with a time step of 0.1s and each trajectory 10s long. The dataset is normalized through Z-score normalization.



Cross-lingual Retrieval for Iterative Self-Supervised Training (supplementary materials) 1 Experiment details

Neural Information Processing Systems

Becauseof the file size limit, we will release the source code and pretrained checkpoints after the anonymity period. To be able to make a fair comparison,we followed the same preprocessingsteps as described in [13]. In each iteration, we mine all90 language pairs in parallel, using8 GPUs for each pair, each pair taking about15 30 hours to finish. We lightly tune the margin score threshold using validation BLEU (using threshold score between 1.04and1.07.) For all experiments, we use Transformerwith 12 layers of encoder and 12 layers of decoder with model dimension of1024 on 16 heads ( 680M parameters). 1 We trained for maximum20,000 steps using label-smoothed cross-entropy loss with 0.2 label smoothing,0.3